Using Standard Methods
Let’s see what happens when we apply standard methods to data that just doesn’t work that way. First we’ll start with some generic data to illustrate the point.
We can use a basic linear regression and see how things go.
Call:
lm(formula = y ~ x1 + x2, data = dat)
Residuals:
Min 1Q Median 3Q Max
-8.6634 -1.3220 -0.0571 1.5297 7.0937
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.2745 0.3332 21.83 <2e-16 ***
x1 6.3616 0.4352 14.62 <2e-16 ***
x2 -5.1214 0.4526 -11.31 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.532 on 397 degrees of freedom
Multiple R-squared: 0.4594, Adjusted R-squared: 0.4567
F-statistic: 168.7 on 2 and 397 DF, p-value: < 2.2e-16
Everything is nice and tidy. We have straightforward information, positive effect of x1, negative for x2, and familiar output. Let’s look at some diagnostics.
Some issues might be present, as we might be getting a little more variance with some, especially higher, fitted values. We’re also a little loose in the tails of the distribution of the residuals. Let’s compare our predictions to the data. With a strong model, we might see a cigar shaped cloud converging to a line with slope 1 as the fit gets better. We seem to be having some issues here, as the residual plot noted above.
Now let’s go back and visualize the data. The following plots both predictors against the target variable.
So even with diagnostics that didn’t appear too bad, we appear to not be capturing what’s going on with the data that well at all!
Heteroscedasticity, non-normality, etc.
We can try a log transformation. But this won’t help us.
Call:
lm(formula = log(y) ~ x1 + x2, data = dat)
Residuals:
Min 1Q Median 3Q Max
-3.07226 -0.13090 0.05803 0.22780 0.79379
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.82673 0.05398 33.839 <2e-16 ***
x1 0.93691 0.07050 13.290 <2e-16 ***
x2 -0.68719 0.07333 -9.371 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4102 on 397 degrees of freedom
Multiple R-squared: 0.3968, Adjusted R-squared: 0.3937
F-statistic: 130.6 on 2 and 397 DF, p-value: < 2.2e-16



Polynomial Regression
Likewise, we can try polynomial regression, but for most situations, the functional form is just something we might guess at, and/or it still won’t help. Let’s look at a fairly complex relationship.
Polynomial regression is problematic
A standard linear regression is definitely not going to capture this relationship. As above, we could try and use polynomial regression here, e.g. fitting a quadratic or cubic function within the standard regression framework. However, this is unrealistic at best and at worst isn’t useful for complex relationships. In the following, even with a polynomial of degree 15 the fit is fairly poor in many areas, and ‘wiggles’ in some places where there doesn’t appear to be a need to.
fits = sapply(seq(3,15, 3), function(p) fitted(lm(y~poly(x,p)))) %>%
data.frame(x, y, .) %>%
gather(key=polynomial, value=fits, -x, -y) %>%
mutate(polynomial = factor(polynomial, labels=seq(3,15, 3)))
plot_ly(data=d) %>%
add_markers(~x, ~y, marker=list(color='#ff5500', opacity=.2), showlegend=F) %>%
add_lines(~x, ~fits, color=~polynomial, data=fits) %>%
theme_plotly()
GAM to the rescue!
Ignoring unknown parameters: method, formula

LS0tCnRpdGxlOiAiVGhlIENhc2UgZm9yIEdBTXMiCm91dHB1dDogCiAgaHRtbF9ub3RlYm9vazogCiAgICBjb2RlX2ZvbGRpbmc6IG5vbmUKICAgIGhpZ2hsaWdodDogcHlnbWVudHMKICAgIHRoZW1lOiBzYW5kc3RvbmUKZWRpdG9yX29wdGlvbnM6IAogIGNodW5rX291dHB1dF90eXBlOiBpbmxpbmUKLS0tCgojIyBJbml0aWFsaXphdGlvbgoKYGBge3IgbWlzY19mdW5jdGlvbnN9CnNvdXJjZSgnbWlzY19mdW5jdGlvbnMvZnVuY3Rpb25zLlInKQpgYGAKCmBgYHtyIGxvYWRfcGFja2FnZXN9CmxpYnJhcnkodGlkeXZlcnNlKQpsaWJyYXJ5KHBsb3RseSkKbGlicmFyeShtb2RlbHIpCmxpYnJhcnkobWdjdikKYGBgCgoKIyMgVXNpbmcgU3RhbmRhcmQgTWV0aG9kcwoKTGV0J3Mgc2VlIHdoYXQgaGFwcGVucyB3aGVuIHdlIGFwcGx5IHN0YW5kYXJkIG1ldGhvZHMgdG8gZGF0YSB0aGF0IGp1c3QgZG9lc24ndCB3b3JrIHRoYXQgd2F5LiAgRmlyc3Qgd2UnbGwgc3RhcnQgd2l0aCBzb21lIGdlbmVyaWMgZGF0YSB0byBpbGx1c3RyYXRlIHRoZSBwb2ludC4KCmBgYHtyIHNpbXVsYXRlZF9kYXRhfQpzZXQuc2VlZCgxMjMpCmxpYnJhcnkobWdjdikKZGF0ID0gZ2FtU2ltKDEsIG49NDAwLCBkaXN0PSJub3JtYWwiLCBzY2FsZT0xLCB2ZXJib3NlPUYpCmBgYAoKV2UgY2FuIHVzZSBhIGJhc2ljIGxpbmVhciByZWdyZXNzaW9uIGFuZCBzZWUgaG93IHRoaW5ncyBnby4KCmBgYHtyIGRlbW9sbX0KbW9kID0gbG0oeSB+IHgxICsgeDIsIGRhdGE9ZGF0KQpzdW1tYXJ5KG1vZCkKYGBgCgpFdmVyeXRoaW5nIGlzIG5pY2UgYW5kIHRpZHkuIFdlIGhhdmUgc3RyYWlnaHRmb3J3YXJkIGluZm9ybWF0aW9uLCBwb3NpdGl2ZSBlZmZlY3Qgb2YgeDEsIG5lZ2F0aXZlIGZvciB4MiwgYW5kIGZhbWlsaWFyIG91dHB1dC4gTGV04oCZcyBsb29rIGF0IHNvbWUgZGlhZ25vc3RpY3MuCgpgYGB7ciBkaWFnbm9zdGljc30KcGxvdChtb2QsIHdoaWNoID0gMToyKQpgYGAKClNvbWUgaXNzdWVzIG1pZ2h0IGJlIHByZXNlbnQsIGFzIHdlIG1pZ2h0IGJlIGdldHRpbmcgYSBsaXR0bGUgbW9yZSB2YXJpYW5jZSB3aXRoIHNvbWUsIGVzcGVjaWFsbHkgaGlnaGVyLCBmaXR0ZWQgdmFsdWVzLiBXZeKAmXJlIGFsc28gYSBsaXR0bGUgbG9vc2UgaW4gdGhlIHRhaWxzIG9mIHRoZSBkaXN0cmlidXRpb24gb2YgdGhlIHJlc2lkdWFscy4gTGV04oCZcyBjb21wYXJlIG91ciBwcmVkaWN0aW9ucyB0byB0aGUgZGF0YS4gV2l0aCBhIHN0cm9uZyBtb2RlbCwgd2UgbWlnaHQgc2VlIGEgY2lnYXIgc2hhcGVkIGNsb3VkIGNvbnZlcmdpbmcgdG8gYSBsaW5lIHdpdGggc2xvcGUgMSBhcyB0aGUgZml0IGdldHMgYmV0dGVyLiBXZSBzZWVtIHRvIGJlIGhhdmluZyBzb21lIGlzc3VlcyBoZXJlLCBhcyB0aGUgcmVzaWR1YWwgcGxvdCBub3RlZCBhYm92ZS4KCmBgYHtyIGdlbmVyYWxfZml0fQojIHJlcXVpcmVzIGJyb29tIGFuZCBnbHVlIHBhY2thZ2VzCmJyb29tOjphdWdtZW50KG1vZCkgJT4lIAogIGdncGxvdChhZXMoeD0uZml0dGVkLCB5PXkpKSArCiAgZ2VvbV9wb2ludChhbHBoYT0uMjUsIGNvbG9yPScjZmY1NTAwJykgKyAKICBnZW9tX3Ntb290aChzZT1GLCBjb2xvcj0nIzAwYWFmZicpICsKICBhbm5vdGF0ZSgndGV4dCcsCiAgICAgICAgICAgbGFiZWwgPSBnbHVlOjpnbHVlKCJSc3EgPSB7cm91bmQoc3VtbWFyeShtb2QpJHIuc3F1YXJlZCwgMil9IiksCiAgICAgICAgICAgeCA9IDQsCiAgICAgICAgICAgeSA9IDE2KSArCiAgbGFicyh0aXRsZT0nRml0dGVkIHZzLiBPYnNlcnZlZCcpICsKICB0aGVtZV9taW5pbWFsKCkKYGBgCgpOb3cgbGV04oCZcyBnbyBiYWNrIGFuZCB2aXN1YWxpemUgdGhlIGRhdGEuIFRoZSBmb2xsb3dpbmcgcGxvdHMgYm90aCBwcmVkaWN0b3JzIGFnYWluc3QgdGhlIHRhcmdldCB2YXJpYWJsZS4KCmBgYHtyIHJlbGF0aW9uc2hpcHN9CmRhdCAlPiUgCiAgc2VsZWN0KHgxLCB4MiwgeSkgJT4lIAogIGdhdGhlcihrZXk9dmFyaWFibGUsIHZhbHVlPVByZWRpY3RvciwgLXkpICU+JSAKICBnZ3Bsb3QoYWVzKHg9UHJlZGljdG9yLCB5PXkpKSArCiAgZ2VvbV9wb2ludChhbHBoYT0uMjUsIGNvbG9yPScjZmY1NTAwJykgKyAKICBnZW9tX3Ntb290aChhZXMoKSwgY29sb3I9JyMwMGFhZmYnLCBzZT1GKSArCiAgZmFjZXRfZ3JpZCh+dmFyaWFibGUpICsgCiAgbGFicyh0aXRsZT0nUHJlZGljdG9ycyB2cy4gWScpICsKICB0aGVtZV90cnVlTWluaW1hbCgpCmBgYAoKU28gZXZlbiB3aXRoIGRpYWdub3N0aWNzIHRoYXQgZGlkbid0IGFwcGVhciB0b28gYmFkLCB3ZSBhcHBlYXIgdG8gbm90IGJlIGNhcHR1cmluZyB3aGF0J3MgZ29pbmcgb24gd2l0aCB0aGUgZGF0YSB0aGF0IHdlbGwgYXQgYWxsIQoKIyMgSGV0ZXJvc2NlZGFzdGljaXR5LCBub24tbm9ybWFsaXR5LCBldGMuCgpXZSBjYW4gdHJ5IGEgbG9nIHRyYW5zZm9ybWF0aW9uLiBCdXQgdGhpcyB3b24ndCBoZWxwIHVzLgoKYGBge3IgbG9nX25vX2hlbHB9Cm1vZGxvZyA9IGxtKGxvZyh5KSB+IHgxICsgeDIsIGRhdCkKc3VtbWFyeShtb2Rsb2cpCnBsb3QobW9kbG9nLCB3aGljaCA9IDE6MikKCmJyb29tOjphdWdtZW50KG1vZCkgJT4lIAogIGdncGxvdChhZXMoeD0uZml0dGVkLCB5PXkpKSArCiAgZ2VvbV9wb2ludChhbHBoYT0uMjUsIGNvbG9yPScjZmY1NTAwJykgKyAKICBnZW9tX3Ntb290aChzZT1GLCBjb2xvcj0nIzAwYWFmZicpICsKICBhbm5vdGF0ZSgndGV4dCcsCiAgICAgICAgICAgbGFiZWwgPSBnbHVlOjpnbHVlKCJSc3EgPSB7cm91bmQoc3VtbWFyeShtb2Rsb2cpJHIuc3F1YXJlZCwgMil9IiksCiAgICAgICAgICAgeCA9IDQsCiAgICAgICAgICAgeSA9IDE2KSArCiAgbGFicyh0aXRsZT0nRml0dGVkIHZzLiBPYnNlcnZlZCcpICsKICB0aGVtZV90cnVlTWluaW1hbCgpCmBgYAoKCiMjIFBvbHlub21pYWwgUmVncmVzc2lvbgoKTGlrZXdpc2UsIHdlIGNhbiB0cnkgcG9seW5vbWlhbCByZWdyZXNzaW9uLCBidXQgZm9yIG1vc3Qgc2l0dWF0aW9ucywgdGhlIGZ1bmN0aW9uYWwgZm9ybSBpcyBqdXN0IHNvbWV0aGluZyB3ZSBtaWdodCBndWVzcyBhdCwgYW5kL29yIGl0IHN0aWxsIHdvbid0IGhlbHAuICBMZXQncyBsb29rIGF0IGEgZmFpcmx5IGNvbXBsZXggcmVsYXRpb25zaGlwLgoKYGBge3Igc2ltRGF0YX0Kc2V0LnNlZWQoMTIzKQp4ID0gcnVuaWYoNTAwKQptdSA9IHNpbigyICogKDQgKiB4IC0gMikpICsgMiAqIGV4cCgtKDE2IF4gMikgKiAoKHggLSAuNSkgXiAyKSkKeSA9IHJub3JtKDUwMCwgbXUsIC4zKQpkID0gZGF0YS5mcmFtZSh4LHkpIApgYGAKCgojIyMjIFBvbHlub21pYWwgcmVncmVzc2lvbiBpcyBwcm9ibGVtYXRpYwoKQSBzdGFuZGFyZCBsaW5lYXIgcmVncmVzc2lvbiBpcyBkZWZpbml0ZWx5IG5vdCBnb2luZyB0byBjYXB0dXJlIHRoaXMgcmVsYXRpb25zaGlwLiAgQXMgYWJvdmUsIHdlIGNvdWxkIHRyeSBhbmQgdXNlIHBvbHlub21pYWwgcmVncmVzc2lvbiBoZXJlLCBlLmcuIGZpdHRpbmcgYSBxdWFkcmF0aWMgb3IgY3ViaWMgZnVuY3Rpb24gd2l0aGluIHRoZSBzdGFuZGFyZCByZWdyZXNzaW9uIGZyYW1ld29yay4gIEhvd2V2ZXIsIHRoaXMgaXMgdW5yZWFsaXN0aWMgYXQgYmVzdCBhbmQgYXQgd29yc3QgaXNuJ3QgdXNlZnVsIGZvciBjb21wbGV4IHJlbGF0aW9uc2hpcHMuIEluIHRoZSBmb2xsb3dpbmcsIGV2ZW4gd2l0aCBhIHBvbHlub21pYWwgb2YgZGVncmVlIDE1IHRoZSBmaXQgaXMgZmFpcmx5IHBvb3IgaW4gbWFueSBhcmVhcywgYW5kICd3aWdnbGVzJyBpbiBzb21lIHBsYWNlcyB3aGVyZSB0aGVyZSBkb2Vzbid0IGFwcGVhciB0byBiZSBhIG5lZWQgdG8uCgpgYGB7ciBzaW1EYXRhUGxvdCwgbWVzc2FnZT1GfQpmaXRzID0gc2FwcGx5KHNlcSgzLDE1LCAzKSwgZnVuY3Rpb24ocCkgZml0dGVkKGxtKHl+cG9seSh4LHApKSkpICU+JSAKICBkYXRhLmZyYW1lKHgsIHksIC4pICU+JSAKICBnYXRoZXIoa2V5PXBvbHlub21pYWwsIHZhbHVlPWZpdHMsIC14LCAteSkgJT4lIAogIG11dGF0ZShwb2x5bm9taWFsID0gZmFjdG9yKHBvbHlub21pYWwsIGxhYmVscz1zZXEoMywxNSwgMykpKQoKcGxvdF9seShkYXRhPWQpICU+JSAKICBhZGRfbWFya2Vycyh+eCwgfnksIG1hcmtlcj1saXN0KGNvbG9yPScjZmY1NTAwJywgb3BhY2l0eT0uMiksIHNob3dsZWdlbmQ9RikgJT4lIAogIGFkZF9saW5lcyh+eCwgfmZpdHMsIGNvbG9yPX5wb2x5bm9taWFsLCBkYXRhPWZpdHMpICU+JSAKICB0aGVtZV9wbG90bHkoKQpgYGAKCgoKIyMgR0FNIHRvIHRoZSByZXNjdWUhCgpgYGB7ciBnYW1fdnNfcG9seX0KZCAlPiUgCiAgcXBsb3QoZGF0YT0uLCAKICAgICAgICB4ID0geCwgCiAgICAgICAgeSA9IHksIAogICAgICAgIGdlb209YygncG9pbnQnLCAnc21vb3RoJyksIAogICAgICAgIG1ldGhvZCA9ICdnYW0nLAogICAgICAgIGZvcm11bGEgPSB5IH4gcyh4KSkKYGBgCgo=